Goto

Collaborating Authors

 response pattern




DevBench: A multimodal developmental benchmark for language learning

Neural Information Processing Systems

How (dis)similar are the learning trajectories of vision-language models and children? Recent modeling work has attempted to understand the gap between models' and humans' data efficiency by constructing models trained on less data, especially multimodal naturalistic data. However, such models are often evaluated on adult-level benchmarks, with limited breadth in language abilities tested, and without direct comparison to behavioral data. We introduce DevBench, a multimodal benchmark comprising seven language evaluation tasks spanning the domains of lexical, syntactic, and semantic ability, with behavioral data from both children and adults. We evaluate a set of vision-language models on these tasks, comparing models and humans on their response patterns, not their absolute performance. Across tasks, models exhibit variation in their closeness to human response patterns, and models that perform better on a task also more closely resemble human behavioral responses. We also examine the developmental trajectory of OpenCLIP over training, finding that greater training results in closer approximations to adult response patterns. DevBench thus provides a benchmark for comparing models to human language development. These comparisons highlight ways in which model and human language learning processes diverge, providing insight into entry points for improving language models.


Manifolds and Modules: How Function Develops in a Neural Foundation Model

Bertram, Johannes, Dyballa, Luciano, Keller, T. Anderson, Kinger, Savik, Zucker, Steven W.

arXiv.org Artificial Intelligence

Foundation models have shown remarkable success in fitting biological visual systems; however, their black-box nature inherently limits their utility for understanding brain function. Here, we peek inside a SOTA foundation model of neural activity (Wang et al., 2025) as a physiologist might, characterizing each 'neuron' based on its temporal response properties to parametric stimuli. We analyze how different stimuli are represented in neural activity space by building decoding manifolds, and we analyze how different neurons are represented in stimulus-response space by building neural encoding manifolds. We find that the different processing stages of the model (i.e., the feedforward encoder, recurrent, and readout modules) each exhibit qualitatively different representational structures in these manifolds. The recurrent module shows a jump in capabilities over the encoder module by 'pushing apart' the representations of different temporal stimulus patterns; while the readout module achieves biological fidelity by using numerous specialized feature maps rather than biologically plausible mechanisms. Overall, we present this work as a study of the inner workings of a prominent neural foundation model, gaining insights into the biological relevance of its internals through the novel analysis of its neurons' joint temporal response patterns.



Markov Missing Graph: A Graphical Approach for Missing Data Imputation

Yang, Yanjiao, Chen, Yen-Chi

arXiv.org Machine Learning

We introduce the Markov missing graph (MMG), a novel framework that imputes missing data based on undirected graphs. MMG leverages conditional independence relationships to locally decompose the imputation model. To establish the identification, we introduce the Principle of Available Information (PAI), which guides the use of all relevant observed data. We then propose a flexible statistical learning paradigm, MMG Imputation Risk Minimization under PAI, that frames the imputation task as an empirical risk minimization problem. This framework is adaptable to various modeling choices. We develop theories of MMG, including the connection between MMG and Little's complete-case missing value assumption, recovery under missing completely at random, efficiency theory, and graph-related properties. We show the validity of our method with simulation studies and illustrate its application with a real-world Alzheimer's data set.



Artificial Finance: How AI Thinks About Money

Erdem, Orhan, Ashok, Ragavi Pobbathi

arXiv.org Artificial Intelligence

In this paper, we explore how large language models (LLMs) approach financial decision - making by systematically comparing their responses to those of human participants across the globe. We posed a set of commonly used financial decision - making questions t o seven leading LLMs, including five models from the GPT series (GPT - 4o, GPT - 4.5, o1, o3 - mini), Gemini 2.0 Flash, and DeepSeek R1 . We then compared their outputs to human responses drawn from a dataset covering 53 nations. Our analysis reveals three main r esults. First, LLMs generally exhibit a risk - neutral decision - making pattern, favoring choices aligned with expected value calculations when faced with lottery - type questions . Second, when evaluating trade - offs between present and future, LLMs occasionally produce responses that appear inconsistent with normative reasoning . Third, when we examine cross - national similarities, we f ind that the LLMs' aggregate responses most closely resemble those of participants from Tanzania. These findings contribute to the understanding of how LLMs emulate human - like decision behaviors and highlight potential cultural and training influences embedded within their outputs.


DevBench: A multimodal developmental benchmark for language learning

Neural Information Processing Systems

How (dis)similar are the learning trajectories of vision–language models and children? Recent modeling work has attempted to understand the gap between models' and humans' data efficiency by constructing models trained on less data, especially multimodal naturalistic data. However, such models are often evaluated on adult-level benchmarks, with limited breadth in language abilities tested, and without direct comparison to behavioral data. We introduce DevBench, a multimodal benchmark comprising seven language evaluation tasks spanning the domains of lexical, syntactic, and semantic ability, with behavioral data from both children and adults. We evaluate a set of vision–language models on these tasks, comparing models and humans on their response patterns, not their absolute performance.


Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

Strugatski, Alona, Alexandron, Giora

arXiv.org Artificial Intelligence

Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions (MCQs) in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory (IRT) to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics (PFS). We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.